title: ‘Data Preprocessing’ output: html_document: code_folding: ‘hide’ —


Raw data

Dataset downloaded from mgandal’s github repository.

Load and annotate data

# Load csvs
datExpr = read.csv('./../Data/RNAseq_ASD_datExpr.csv', row.names=1)
datMeta = read.csv('./../Data/RNAseq_ASD_datMeta.csv')

# 1. Group brain regions by lobes
# 2. Remove '/' from Batch variable: (It is recommended (but not required) to use only letters, numbers, 
#    and delimiters '_' or '.', in levels of factors as these are safe characters for column names in R
# 3. Transform Diagnosis into a factor variable
datMeta = datMeta %>% mutate(Brain_Region = as.factor(Region)) %>% 
                      mutate(Brain_lobe = ifelse(Brain_Region %in% c('BA4_6', 'BA9', 'BA24', 'BA44_45'), 'Frontal',
                                          ifelse(Brain_Region %in% c('BA3_1_2_5', 'BA7'), 'Parietal',
                                          ifelse(Brain_Region %in% c('BA38', 'BA39_40', 'BA20_37', 'BA41_42_22'), 'Temporal',
                                          'Occipital')))) %>%
                      mutate(Batch = as.factor(gsub('/', '.', RNAExtractionBatch)), 
                             Diagnosis = factor(Diagnosis_, levels=c('CTL','ASD'))) %>% 
                      dplyr::select(-Diagnosis_)


# GO Neuronal annotations: regex 'neuron' in GO functional annotations and label the genes that make a match as neuronal
GO_annotations = read.csv('./../Data/genes_GO_annotations.csv')
GO_neuronal = GO_annotations %>% filter(grepl('neuron', go_term)) %>% 
              mutate('ID'=as.character(ensembl_gene_id)) %>% 
              dplyr::select(-ensembl_gene_id) %>% distinct(ID) %>%
              mutate('Neuronal'=1)


# SFARI Genes
SFARI_genes = read_csv('./../../../SFARI/Data/SFARI_genes_08-29-2019_w_ensembl_IDs.csv')
SFARI_genes = SFARI_genes[!duplicated(SFARI_genes$ID) & !is.na(SFARI_genes$ID),]

# NCBI biotype annotation
NCBI_biotype = read.csv('./../../../NCBI/Data/gene_biotype_info.csv') %>% 
               rename(Ensembl_gene_identifier='ensembl_gene_id', type_of_gene='gene_biotype', Symbol='hgnc_symbol') %>% 
               mutate(gene_biotype = ifelse(gene_biotype=='protein-coding','protein_coding',gene_biotype))

rm(GO_annotations)

Check sample composition

Data description taken from the dataset’s synapse entry: RNAseq data was generated from 88 postmortem cortex brain samples from subjects with ASD (53 samples from 24 subjects) and non-psychiatric controls (35 samples from 17 subjects), across four cortical regions encompassing all major cortical lobes – frontal, temporal, parietal, and occipital. Brain samples were obtained from the Harvard Brain Bank as part of the Autism Tissue Project (ATP).

cat(paste0('Dataset includes ', nrow(datExpr), ' genes from ', ncol(datExpr), ' samples belonging to ', 
             length(unique(datMeta$Subject_ID)), ' different subjects.'))
## Dataset includes 63682 genes from 88 samples belonging to 41 different subjects.

Counts distribution: More than half of the counts are zero and most of the counts are relatively low, but there are some very high outliers

count_distr = summary(melt(datExpr))[,2]
for(i in 1:6){
  print(count_distr[i])
}
##                      
## "Min.   :       0  " 
##                      
## "1st Qu.:       0  " 
##                      
## "Median :       0  " 
##                      
## "Mean   :     564  " 
##                      
## "3rd Qu.:      27  " 
##                      
## "Max.   :27183314  "
rm(count_distr, i)


Diagnosis distribution by Sample: There are more ASD samples than controls

table_info = datMeta %>% apply_labels(Diagnosis = 'Diagnosis', Brain_lobe = 'Brain Lobe', Batch = 'Batch', Sex = 'Gender')
cro(table_info$Diagnosis)
 #Total 
 Diagnosis 
   CTL  35
   ASD  53
   #Total cases  88


Diagnosis distribution by Subject: There are more ASD patients than controls

cro(table_info$Diagnosis[!duplicated(table_info$Subject_ID)])
 #Total 
 Diagnosis 
   CTL  17
   ASD  24
   #Total cases  41

Brain region distribution: All regions seem to be balanced

cro(table_info$Brain_lobe)
 #Total 
 Brain Lobe 
   Frontal  22
   Occipital  23
   Parietal  23
   Temporal  20
   #Total cases  88


Diagnosis and brain region seem to be balanced except for the frontal lobe, where there are more control samples than ASD ones

cro(table_info$Diagnosis, list(table_info$Brain_lobe,total()))
 Brain Lobe     #Total 
 Frontal   Occipital   Parietal   Temporal   
 Diagnosis 
   CTL  13 8 8 6   35
   ASD  9 15 15 14   53
   #Total cases  22 23 23 20   88


Sex distribution: There are many more Male samples than Female ones

cro(table_info$Sex)
 #Total 
 Gender 
   F  15
   M  73
   #Total cases  88


Diagnosis and Gender seem to be balanced

cro(table_info$Diagnosis, list(table_info$Sex, total()))
 Gender     #Total 
 F   M   
 Diagnosis 
   CTL  6 29   35
   ASD  9 44   53
   #Total cases  15 73   88


Age distribution: Subjects between 2 and 60 years old with a mean close to 30

summary(datMeta$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   17.00   28.00   29.74   41.75   60.00


Annotate genes with BioMart information

I was originally running this with the feb2014 version of BioMart because that’s the one that Gandal used (and it finds all of the Ensembl IDs, which other versions don’t), but it has some outdated biotype annotations, to fix this I’ll obtain all the information except the biotype label from BioMart in the same way as it had been done before, and then I’ll add the most current biotype label using information from NCBI’s website and then from BioMart in the following way:

1.Use BioMart to run a query with the original feb2014 version using the Ensembl IDs as keys to obtain all the information except the biotype labels

  1. Annotate genes with Biotype labels:

2.1 Use the NCBI annotations downloaded from NCBI’s website and processed in NCBI/RMarkdowns/20_02_07_clean_data.html (there is information for only 26K genes, so some genes will remain unlabelled)

2.2 Use the current version (jan2020) to obtain the biotype annotations using the Ensembl ID as keys (some genes don’t return a match)

2.3 For the genes that didn’t return a match, use the current version (jan2020) to obtain the biotype annotations using the gene name as keys (17 genes return multiple labels)

2.4 For the genes that returned multiple labels, use the feb2014 version with the Ensembl IDs as keys

labels_source = data.frame(data.frame('source' = c('NCBI', 'BioMart2020_byID', 'BioMart2020_byGene', 'BioMart2014'),
                                      'n_matches' = rep(0,4)))

########################################################################################
# 1. Query archive version

getinfo = c('ensembl_gene_id','external_gene_id','chromosome_name','start_position',
            'end_position','strand')
mart = useMart(biomart='ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl', host='feb2014.archive.ensembl.org')
datGenes = getBM(attributes=getinfo, filters=c('ensembl_gene_id'), values=rownames(datExpr), mart=mart) %>% 
           rename(external_gene_id = 'hgnc_symbol')
## Batch submitting query [-------------------------------] 2% eta: 24s
## Batch submitting query [>------------------------------] 2% eta: 25s
## Batch submitting query [>------------------------------] 3% eta: 26s
## Batch submitting query [>------------------------------] 4% eta: 26s
## Batch submitting query [>------------------------------] 5% eta: 27s
## Batch submitting query [=>-----------------------------] 5% eta: 28s
## Batch submitting query [=>-----------------------------] 6% eta: 28s
## Batch submitting query [=>-----------------------------] 7% eta: 35s
## Batch submitting query [=>-----------------------------] 8% eta: 34s
## Batch submitting query [==>----------------------------] 9% eta: 34s
## Batch submitting query [==>----------------------------] 9% eta: 36s
## Batch submitting query [==>----------------------------] 10% eta: 35s
## Batch submitting query [==>----------------------------] 11% eta: 35s
## Batch submitting query [===>---------------------------] 12% eta: 34s
## Batch submitting query [===>---------------------------] 12% eta: 33s
## Batch submitting query [===>---------------------------] 13% eta: 32s
## Batch submitting query [===>---------------------------] 14% eta: 32s
## Batch submitting query [====>--------------------------] 15% eta: 31s
## Batch submitting query [====>--------------------------] 16% eta: 30s
## Batch submitting query [====>--------------------------] 17% eta: 29s
## Batch submitting query [=====>-------------------------] 18% eta: 29s
## Batch submitting query [=====>-------------------------] 19% eta: 28s
## Batch submitting query [=====>-------------------------] 20% eta: 28s
## Batch submitting query [======>------------------------] 21% eta: 27s
## Batch submitting query [======>------------------------] 22% eta: 27s
## Batch submitting query [======>------------------------] 23% eta: 26s
## Batch submitting query [=======>-----------------------] 24% eta: 25s
## Batch submitting query [=======>-----------------------] 25% eta: 25s
## Batch submitting query [=======>-----------------------] 26% eta: 25s
## Batch submitting query [=======>-----------------------] 27% eta: 24s
## Batch submitting query [========>----------------------] 28% eta: 24s
## Batch submitting query [========>----------------------] 29% eta: 23s
## Batch submitting query [========>----------------------] 30% eta: 23s
## Batch submitting query [=========>---------------------] 31% eta: 22s
## Batch submitting query [=========>---------------------] 32% eta: 22s
## Batch submitting query [=========>---------------------] 33% eta: 22s
## Batch submitting query [=========>---------------------] 34% eta: 21s
## Batch submitting query [==========>--------------------] 34% eta: 21s
## Batch submitting query [==========>--------------------] 35% eta: 21s
## Batch submitting query [==========>--------------------] 36% eta: 20s
## Batch submitting query [==========>--------------------] 37% eta: 20s
## Batch submitting query [===========>-------------------] 38% eta: 20s
## Batch submitting query [===========>-------------------] 38% eta: 19s
## Batch submitting query [===========>-------------------] 39% eta: 19s
## Batch submitting query [===========>-------------------] 40% eta: 19s
## Batch submitting query [============>------------------] 41% eta: 19s
## Batch submitting query [============>------------------] 41% eta: 18s
## Batch submitting query [============>------------------] 42% eta: 18s
## Batch submitting query [============>------------------] 43% eta: 18s
## Batch submitting query [=============>-----------------] 44% eta: 18s
## Batch submitting query [=============>-----------------] 45% eta: 17s
## Batch submitting query [=============>-----------------] 46% eta: 17s
## Batch submitting query [==============>----------------] 47% eta: 16s
## Batch submitting query [==============>----------------] 48% eta: 16s
## Batch submitting query [==============>----------------] 49% eta: 16s
## Batch submitting query [===============>---------------] 50% eta: 15s
## Batch submitting query [===============>---------------] 51% eta: 15s
## Batch submitting query [===============>---------------] 52% eta: 15s
## Batch submitting query [===============>---------------] 52% eta: 14s
## Batch submitting query [===============>---------------] 53% eta: 14s
## Batch submitting query [================>--------------] 54% eta: 14s
## Batch submitting query [================>--------------] 55% eta: 14s
## Batch submitting query [================>--------------] 55% eta: 13s
## Batch submitting query [================>--------------] 56% eta: 13s
## Batch submitting query [=================>-------------] 57% eta: 13s
## Batch submitting query [=================>-------------] 58% eta: 13s
## Batch submitting query [=================>-------------] 59% eta: 12s
## Batch submitting query [==================>------------] 60% eta: 12s
## Batch submitting query [==================>------------] 61% eta: 12s
## Batch submitting query [==================>------------] 62% eta: 11s
## Batch submitting query [===================>-----------] 63% eta: 11s
## Batch submitting query [===================>-----------] 64% eta: 11s
## Batch submitting query [===================>-----------] 65% eta: 11s
## Batch submitting query [===================>-----------] 66% eta: 10s
## Batch submitting query [====================>----------] 66% eta: 10s
## Batch submitting query [====================>----------] 67% eta: 10s
## Batch submitting query [====================>----------] 68% eta: 10s
## Batch submitting query [====================>----------] 69% eta: 9s
## Batch submitting query [=====================>---------] 70% eta: 9s
## Batch submitting query [=====================>---------] 71% eta: 9s
## Batch submitting query [=====================>---------] 72% eta: 8s
## Batch submitting query [======================>--------] 73% eta: 8s
## Batch submitting query [======================>--------] 74% eta: 8s
## Batch submitting query [======================>--------] 75% eta: 7s
## Batch submitting query [======================>--------] 76% eta: 7s
## Batch submitting query [=======================>-------] 77% eta: 7s
## Batch submitting query [=======================>-------] 78% eta: 6s
## Batch submitting query [=======================>-------] 79% eta: 6s
## Batch submitting query [========================>------] 80% eta: 6s
## Batch submitting query [========================>------] 81% eta: 6s
## Batch submitting query [========================>------] 82% eta: 5s
## Batch submitting query [=========================>-----] 83% eta: 5s
## Batch submitting query [=========================>-----] 84% eta: 5s
## Batch submitting query [=========================>-----] 85% eta: 4s
## Batch submitting query [==========================>----] 86% eta: 4s
## Batch submitting query [==========================>----] 87% eta: 4s
## Batch submitting query [==========================>----] 88% eta: 4s
## Batch submitting query [==========================>----] 88% eta: 3s
## Batch submitting query [===========================>---] 89% eta: 3s
## Batch submitting query [===========================>---] 90% eta: 3s
## Batch submitting query [===========================>---] 91% eta: 3s
## Batch submitting query [============================>--] 92% eta: 2s
## Batch submitting query [============================>--] 93% eta: 2s
## Batch submitting query [============================>--] 94% eta: 2s
## Batch submitting query [============================>--] 95% eta: 2s
## Batch submitting query [=============================>-] 95% eta: 1s
## Batch submitting query [=============================>-] 96% eta: 1s
## Batch submitting query [=============================>-] 97% eta: 1s
## Batch submitting query [=============================>-] 98% eta: 1s Batch
## submitting query [==============================>] 98% eta: 0s Batch submitting
## query [==============================>] 99% eta: 0s Batch submitting query
## [===============================] 100% eta: 0s
datGenes$length = datGenes$end_position-datGenes$start_position

cat(paste0('1. ', sum(is.na(datGenes$start_position)), '/', nrow(datGenes),
             ' Ensembl IDs weren\'t found in the feb2014 version of BioMart'))
## 1. 0/63677 Ensembl IDs weren't found in the feb2014 version of BioMart
########################################################################################
########################################################################################
# 2. Get Biotype Labels

cat('2. Add biotype information')
## 2. Add biotype information
########################################################################################
# 2.1 Add NCBI annotations
datGenes = datGenes %>% left_join(NCBI_biotype, by=c('ensembl_gene_id','hgnc_symbol'))

cat(paste0('2.1 ' , sum(is.na(datGenes$gene_biotype)), '/', nrow(datGenes),
             ' Ensembl IDs weren\'t found in the NCBI database'))
## 2.1 42904/63677 Ensembl IDs weren't found in the NCBI database
labels_source$n_matches[1] = sum(!is.na(datGenes$gene_biotype))

########################################################################################
# 2.2 Query current BioMart version for gene_biotype using Ensembl ID as key

getinfo = c('ensembl_gene_id','gene_biotype')
mart = useMart(biomart='ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl', host='jan2020.archive.ensembl.org')
datGenes_biotype = getBM(attributes=getinfo, filters=c('ensembl_gene_id'), mart=mart, 
                         values=datGenes$ensembl_gene_id[is.na(datGenes$gene_biotype)])
## Batch submitting query [>------------------------------] 2% eta: 23s
## Batch submitting query [>------------------------------] 3% eta: 30s
## Batch submitting query [>------------------------------] 5% eta: 26s
## Batch submitting query [=>-----------------------------] 6% eta: 24s
## Batch submitting query [=>-----------------------------] 7% eta: 22s
## Batch submitting query [==>----------------------------] 8% eta: 20s
## Batch submitting query [==>----------------------------] 9% eta: 19s
## Batch submitting query [==>----------------------------] 10% eta: 18s
## Batch submitting query [===>---------------------------] 12% eta: 17s
## Batch submitting query [===>---------------------------] 13% eta: 16s
## Batch submitting query [===>---------------------------] 14% eta: 16s
## Batch submitting query [====>--------------------------] 15% eta: 15s
## Batch submitting query [====>--------------------------] 16% eta: 38s
## Batch submitting query [====>--------------------------] 17% eta: 36s
## Batch submitting query [=====>-------------------------] 19% eta: 1m
## Batch submitting query [=====>-------------------------] 20% eta: 1m
## Batch submitting query [=====>-------------------------] 21% eta: 1m
## Batch submitting query [======>------------------------] 22% eta: 49s
## Batch submitting query [======>------------------------] 23% eta: 46s
## Batch submitting query [=======>-----------------------] 24% eta: 44s
## Batch submitting query [=======>-----------------------] 26% eta: 42s
## Batch submitting query [=======>-----------------------] 27% eta: 40s
## Batch submitting query [========>----------------------] 28% eta: 38s
## Batch submitting query [========>----------------------] 29% eta: 40s
## Batch submitting query [========>----------------------] 30% eta: 1m
## Batch submitting query [=========>---------------------] 31% eta: 1m
## Batch submitting query [=========>---------------------] 33% eta: 1m
## Batch submitting query [=========>---------------------] 34% eta: 1m
## Batch submitting query [==========>--------------------] 35% eta: 49s
## Batch submitting query [==========>--------------------] 36% eta: 1m
## Batch submitting query [===========>-------------------] 37% eta: 48s
## Batch submitting query [===========>-------------------] 38% eta: 46s
## Batch submitting query [===========>-------------------] 40% eta: 48s
## Batch submitting query [============>------------------] 41% eta: 46s
## Batch submitting query [============>------------------] 42% eta: 47s
## Batch submitting query [============>------------------] 43% eta: 45s
## Batch submitting query [=============>-----------------] 44% eta: 47s
## Batch submitting query [=============>-----------------] 45% eta: 45s
## Batch submitting query [=============>-----------------] 47% eta: 47s
## Batch submitting query [==============>----------------] 48% eta: 45s
## Batch submitting query [==============>----------------] 49% eta: 43s
## Batch submitting query [===============>---------------] 50% eta: 43s
## Batch submitting query [===============>---------------] 51% eta: 42s
## Batch submitting query [===============>---------------] 52% eta: 40s
## Batch submitting query [================>--------------] 53% eta: 39s
## Batch submitting query [================>--------------] 55% eta: 39s
## Batch submitting query [================>--------------] 56% eta: 37s
## Batch submitting query [=================>-------------] 57% eta: 37s
## Batch submitting query [=================>-------------] 58% eta: 36s
## Batch submitting query [=================>-------------] 59% eta: 34s
## Batch submitting query [==================>------------] 60% eta: 33s
## Batch submitting query [==================>------------] 62% eta: 33s
## Batch submitting query [==================>------------] 63% eta: 31s
## Batch submitting query [===================>-----------] 64% eta: 31s
## Batch submitting query [===================>-----------] 65% eta: 31s
## Batch submitting query [====================>----------] 66% eta: 30s
## Batch submitting query [====================>----------] 67% eta: 30s
## Batch submitting query [====================>----------] 69% eta: 28s
## Batch submitting query [=====================>---------] 70% eta: 27s
## Batch submitting query [=====================>---------] 71% eta: 25s
## Batch submitting query [=====================>---------] 72% eta: 25s
## Batch submitting query [======================>--------] 73% eta: 23s
## Batch submitting query [======================>--------] 74% eta: 22s
## Batch submitting query [======================>--------] 76% eta: 22s
## Batch submitting query [=======================>-------] 77% eta: 20s
## Batch submitting query [=======================>-------] 78% eta: 20s
## Batch submitting query [========================>------] 79% eta: 19s
## Batch submitting query [========================>------] 80% eta: 18s
## Batch submitting query [========================>------] 81% eta: 17s
## Batch submitting query [=========================>-----] 83% eta: 16s
## Batch submitting query [=========================>-----] 84% eta: 14s
## Batch submitting query [=========================>-----] 85% eta: 14s
## Batch submitting query [==========================>----] 86% eta: 13s
## Batch submitting query [==========================>----] 87% eta: 13s
## Batch submitting query [==========================>----] 88% eta: 11s
## Batch submitting query [===========================>---] 90% eta: 10s
## Batch submitting query [===========================>---] 91% eta: 9s
## Batch submitting query [===========================>---] 92% eta: 8s
## Batch submitting query [============================>--] 93% eta: 7s
## Batch submitting query [============================>--] 94% eta: 5s
## Batch submitting query [=============================>-] 95% eta: 4s
## Batch submitting query [=============================>-] 97% eta: 3s Batch
## submitting query [=============================>-] 98% eta: 2s Batch submitting
## query [==============================>] 99% eta: 1s Batch submitting query
## [===============================] 100% eta: 0s
cat(paste0('2.2 ' , sum(is.na(datGenes$gene_biotype))-nrow(datGenes_biotype), '/', sum(is.na(datGenes$gene_biotype)),
             ' Ensembl IDs weren\'t found in the jan2020 version of BioMart when querying by Ensembl ID'))
## 2.2 9099/42904 Ensembl IDs weren't found in the jan2020 version of BioMart when querying by Ensembl ID
# Add new gene_biotype info to datGenes
datGenes = datGenes %>% left_join(datGenes_biotype, by='ensembl_gene_id') %>%
           mutate(gene_biotype = coalesce(as.character(gene_biotype.x), gene_biotype.y)) %>%
           dplyr::select(-gene_biotype.x, -gene_biotype.y)

labels_source$n_matches[2] = sum(!is.na(datGenes$gene_biotype)) - labels_source$n_matches[1]

########################################################################################
# 3. Query current BioMart version for gene_biotype using gene symbol as key

missing_genes = unique(datGenes$hgnc_symbol[is.na(datGenes$gene_biotype)])
getinfo = c('hgnc_symbol','gene_biotype')
datGenes_biotype_by_gene = getBM(attributes=getinfo, filters=c('hgnc_symbol'), mart=mart,
                                 values=missing_genes)
## Batch submitting query [===>---------------------------] 12% eta: 5s
## Batch submitting query [=====>-------------------------] 19% eta: 3s
## Batch submitting query [=======>-----------------------] 25% eta: 8s
## Batch submitting query [=========>---------------------] 31% eta: 6s
## Batch submitting query [===========>-------------------] 38% eta: 5s
## Batch submitting query [=============>-----------------] 44% eta: 4s
## Batch submitting query [===============>---------------] 50% eta: 6s
## Batch submitting query [================>--------------] 56% eta: 9s
## Batch submitting query [==================>------------] 62% eta: 7s
## Batch submitting query [====================>----------] 69% eta: 6s
## Batch submitting query [======================>--------] 75% eta: 5s
## Batch submitting query [========================>------] 81% eta: 3s Batch
## submitting query [==========================>----] 88% eta: 2s Batch submitting
## query [============================>--] 94% eta: 1s Batch submitting query
## [===============================] 100% eta: 0s
cat(paste0('2.3 ', length(missing_genes)-length(unique(datGenes_biotype_by_gene$hgnc_symbol)),'/',length(missing_genes),
             ' genes weren\'t found in the current BioMart version when querying by gene name'))
## 2.3 5712/7866 genes weren't found in the current BioMart version when querying by gene name
dups = unique(datGenes_biotype_by_gene$hgnc_symbol[duplicated(datGenes_biotype_by_gene$hgnc_symbol)])
cat(paste0('    ', length(dups), ' genes returned multiple labels (these won\'t be added)'))
##     17 genes returned multiple labels (these won't be added)
# Update information
datGenes_biotype_by_gene = datGenes_biotype_by_gene %>% filter(!hgnc_symbol %in% dups)
datGenes = datGenes %>% left_join(datGenes_biotype_by_gene, by='hgnc_symbol') %>% 
            mutate(gene_biotype = coalesce(gene_biotype.x, gene_biotype.y)) %>%
            dplyr::select(-gene_biotype.x, -gene_biotype.y)

labels_source$n_matches[3] = sum(!is.na(datGenes$gene_biotype)) - sum(labels_source$n_matches)

########################################################################################
# 4. Query feb2014 BioMart version for the missing biotypes

missing_ensembl_ids = unique(datGenes$ensembl_gene_id[is.na(datGenes$gene_biotype)])

getinfo = c('ensembl_gene_id','gene_biotype')
mart = useMart(biomart='ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl', host='feb2014.archive.ensembl.org')
datGenes_biotype_archive = getBM(attributes=getinfo, filters=c('ensembl_gene_id'), values=missing_ensembl_ids, mart=mart)
## Batch submitting query [===>---------------------------] 14% eta: 1s
## Batch submitting query [======>------------------------] 21% eta: 2s
## Batch submitting query [========>----------------------] 29% eta: 2s
## Batch submitting query [==========>--------------------] 36% eta: 2s
## Batch submitting query [============>------------------] 43% eta: 2s
## Batch submitting query [===============>---------------] 50% eta: 1s
## Batch submitting query [=================>-------------] 57% eta: 2s
## Batch submitting query [===================>-----------] 64% eta: 1s
## Batch submitting query [=====================>---------] 71% eta: 1s
## Batch submitting query [=======================>-------] 79% eta: 1s Batch
## submitting query [==========================>----] 86% eta: 1s Batch submitting
## query [============================>--] 93% eta: 0s Batch submitting query
## [===============================] 100% eta: 0s
cat(paste0('2.4 ', length(missing_ensembl_ids)-nrow(datGenes_biotype_archive),'/',length(missing_ensembl_ids),
             ' genes weren\'t found in the feb2014 BioMart version when querying by Ensembl ID'))
## 2.4 0/6648 genes weren't found in the feb2014 BioMart version when querying by Ensembl ID
# Update information
datGenes = datGenes %>% left_join(datGenes_biotype_archive, by='ensembl_gene_id') %>% 
            mutate(gene_biotype = coalesce(gene_biotype.x, gene_biotype.y)) %>%
            dplyr::select(-gene_biotype.x, -gene_biotype.y)

labels_source$n_matches[4] = sum(!is.na(datGenes$gene_biotype)) - sum(labels_source$n_matches)

########################################################################################
# Plot results

labels_source = labels_source %>% mutate(x = 1, percentage = round(100*n_matches/sum(n_matches),1))

p = labels_source %>% ggplot(aes(x, percentage, fill=source)) + geom_bar(position = 'stack', stat = 'identity') +
    theme_minimal() + coord_flip() + theme(legend.position='bottom', axis.title.y=element_blank(),
    axis.text.y=element_blank(), axis.ticks.y=element_blank())

ggplotly(p + theme(legend.position='none'))
as_ggplot(get_legend(p))

########################################################################################
# Reorder rows to match datExpr
datGenes = datGenes[match(rownames(datExpr), datGenes$ensembl_gene_id),]


rm(getinfo, mart, datGenes_biotype, datGenes_biotype_by_gene, datGenes_biotype_archive,
   dups, missing_ensembl_ids, missing_genes, labels_source, p)

Filtering

Checking how many SFARI genes are in the dataset

df = SFARI_genes %>% dplyr::select(-gene_biotype) %>% inner_join(datGenes, by=c('ID'='ensembl_gene_id'))

cat(paste0('Considering all genes, this dataset contains ', length(unique(df$`gene-symbol`)),
             ' of the ', length(unique(SFARI_genes$`gene-symbol`)), ' SFARI genes\n\n'))
## Considering all genes, this dataset contains 979 of the 980 SFARI genes
cat(paste0('The missing gene is ',
           SFARI_genes$`gene-symbol`[! SFARI_genes$`gene-symbol` %in% df$`gene-symbol`],
           ' with a SFARI score of ',
           SFARI_genes$`gene-score`[! SFARI_genes$`gene-symbol` %in% df$`gene-symbol`]))
## The missing gene is MIR137 with a SFARI score of 3
rm(df)


1. Filter entries that don’t correspond to genes

to_keep = !is.na(datGenes$length)
cat(paste0('Names of the rows removed: ', paste(rownames(datExpr)[!to_keep], collapse=', ')))
## Names of the rows removed: __no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
datGenes = datGenes[to_keep,]
datExpr = datExpr[to_keep,]
rownames(datGenes) = datGenes$ensembl_gene_id

cat(paste0('Removed ', sum(!to_keep), ' \'genes\', ', sum(to_keep), ' remaining'))
## Removed 5 'genes', 63677 remaining


2. Filter genes that do not encode any protein

cat(paste0(sum(datGenes$gene_biotype=='protein_coding'), '/', nrow(datGenes), ' are protein coding genes' ))
## 22543/63677 are protein coding genes
sort(table(datGenes$gene_biotype), decreasing=TRUE)
## 
##                     protein_coding                             lncRNA 
##                              22543                              12167 
##               processed_pseudogene             unprocessed_pseudogene 
##                              10117                               2547 
##                                  1                              miRNA 
##                               2314                               2276 
##                           misc_RNA                              snRNA 
##                               2178                               2043 
##                         pseudogene                             snoRNA 
##                               1410                               1202 
##                            lincRNA transcribed_unprocessed_pseudogene 
##                                840                                682 
##                    rRNA_pseudogene   transcribed_processed_pseudogene 
##                                500                                441 
##                          antisense                                  3 
##                                380                                331 
##                                  6                    IG_V_pseudogene 
##                                314                                254 
##                          IG_V_gene                          TR_V_gene 
##                                179                                146 
##     transcribed_unitary_pseudogene                          TR_J_gene 
##                                 86                                 81 
##                 unitary_pseudogene               processed_transcript 
##                                 74                                 72 
##                     sense_intronic                          IG_D_gene 
##                                 72                                 64 
##                               rRNA                    TR_V_pseudogene 
##                                 49                                 46 
##                  sense_overlapping                             scaRNA 
##                                 38                                 31 
##             polymorphic_pseudogene                                  7 
##                                 28                                 25 
##                          IG_J_gene                          IG_C_gene 
##                                 24                                 23 
##                            Mt_tRNA                                  4 
##                                 22                                 17 
##                    IG_C_pseudogene                                TEC 
##                                 11                                 11 
##                          TR_C_gene           3prime_overlapping_ncrna 
##                                  8                                  6 
##                    IG_J_pseudogene                           ribozyme 
##                                  6                                  5 
##                    TR_J_pseudogene                          TR_D_gene 
##                                  5                                  3 
##                            Mt_rRNA                                  8 
##                                  2                                  1 
##    translated_processed_pseudogene  translated_unprocessed_pseudogene 
##                                  1                                  1 
##                           vaultRNA 
##                                  1

Most of the genes with low expression levels are not protein-coding

plot_data = data.frame('ID' = rownames(datExpr), 'MeanExpr' = apply(datExpr, 1, mean),
                       'ProteinCoding' = datGenes$gene_biotype=='protein_coding')

ggplotly(plot_data %>% ggplot(aes(log2(MeanExpr+1), fill=ProteinCoding, color=ProteinCoding)) + geom_density(alpha=0.5) + 
         theme_minimal())
rm(plot_data)

We lose 3 genes with a SFARI score

Note: The gene name for Ensembl ID ENSG00000187951 is wrong, it should be AC091057.1 instead of ARHGAP11B, but the biotype is right, so it would still be filtered out

df = SFARI_genes %>% dplyr::select(-gene_biotype) %>% inner_join(datGenes, by=c('ID'='ensembl_gene_id'))

cat(paste0('Filtering protein coding genes, we are left with ', length(unique(df$`gene-symbol`[df$gene_biotype=='protein_coding'])),
             ' SFARI genes'))
## Filtering protein coding genes, we are left with 976 SFARI genes
kable(df %>% filter(! `gene-symbol` %in% df$`gene-symbol`[df$gene_biotype=='protein_coding']) %>% 
      dplyr::select(ID, `gene-symbol`, `gene-score`, gene_biotype, syndromic, `number-of-reports`), caption='Lost Genes')
Lost Genes
ID gene-symbol gene-score gene_biotype syndromic number-of-reports
ENSG00000187951 ARHGAP11B 4 lncRNA 0 2
ENSG00000251593 MSNP1AS 2 processed_pseudogene 0 12
ENSG00000197558 SSPO 4 transcribed_unitary_pseudogene 0 3
rm(df)
if(!all(rownames(datExpr)==rownames(datGenes))) cat('!!! gene rownames do not match!!!')

to_keep = datGenes$gene_biotype=='protein_coding'
datExpr = datExpr %>% filter(to_keep)
datGenes = datGenes %>% filter(to_keep)
rownames(datExpr) = datGenes$ensembl_gene_id
rownames(datGenes) = datGenes$ensembl_gene_id

cat(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## 976 SFARI genes remaining
cat(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## Removed 41134 genes, 22543 remaining


3. Filter genes with low expression levels

\(\qquad\) 3.1 Remove genes with zero expression in all of the samples

to_keep = rowSums(datExpr)>0

cat(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## Removed 3368 genes, 19175 remaining
df = data.frame('rowSums' = rowSums(datExpr), 'ensembl_gene_id' = rownames(datExpr)) %>%
     right_join(SFARI_genes, by='ensembl_gene_id') %>% filter(rowSums==0 & `gene-score` %in% c(1,2,3)) %>%
     arrange(`gene-score`) %>% dplyr::select(-ensembl_gene_id) %>% filter(!duplicated(`gene-symbol`))

kable(df %>% dplyr::select(ID, `gene-symbol`, `gene-score`, gene_biotype, syndromic, `number-of-reports`), 
      caption='Lost Genes with Top Scores')
Lost Genes with Top Scores
ID gene-symbol gene-score gene_biotype syndromic number-of-reports
ENSG00000267910 KMT2A 1 protein_coding 1 20
ENSG00000227460 SYNGAP1 1 protein_coding 1 50
ENSG00000265594 CEP41 2 protein_coding 0 5
ENSG00000272883 CNOT3 2 protein_coding 1 5
ENSG00000259938 CUX1 2 protein_coding 0 6
ENSG00000271019 MBOAT7 2 protein_coding 1 3
ENSG00000268563 MECP2 2 protein_coding 1 75
ENSG00000262024 TCF20 2 protein_coding 1 10
ENSG00000266334 APH1A 3 protein_coding 0 2
ENSG00000260508 GGNBP2 3 protein_coding 0 2
ENSG00000272706 GRIK5 3 protein_coding 0 8
ENSG00000269816 KDM5C 3 protein_coding 0 22
ENSG00000235718 MFRP 3 protein_coding 0 6
ENSG00000267946 TMLHE 3 protein_coding 0 5
ENSG00000260150 ZMYND11 3 protein_coding 0 10
datGenes = datGenes[to_keep,]
datExpr = datExpr[to_keep,]

cat(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## 970 SFARI genes remaining
rm(df)

\(\qquad\) 2.2 Removing genes with a high percentage of zeros


Choosing the threshold:

Criteria for selecting the percentage of zeros threshold: The minimum value in which the preprocessed data is relatively homoscedastic (we’re trying to get rid of the group of genes with very low mean and SD that make the cloud of points look like a comic book speech bubble)

datMeta_original = datMeta
datExpr_original = datExpr
datGenes_original = datGenes
# Return to original variables
datExpr = datExpr_original
datGenes = datGenes_original
datMeta = datMeta_original

rm(datExpr_original, datGenes_original, datMeta_original, datExpr_vst, datGenes_vst, datMeta_vst)


Filtering

# Minimum percentage of non-zero entries allowed per gene
threshold = 75

plot_data = data.frame('id'=rownames(datExpr), 'non_zero_percentage' = apply(datExpr, 1, function(x) 100*mean(x>0)))

ggplotly(plot_data %>% ggplot(aes(x=non_zero_percentage)) + geom_density(color='#0099cc', fill='#0099cc', alpha=0.3) + 
         geom_vline(xintercept=threshold, color='gray') + #scale_x_log10() + 
         ggtitle('Percentage of non-zero entries distribution') + theme_minimal())
to_keep = apply(datExpr, 1, function(x) 100*mean(x>0)) >= threshold
datGenes = datGenes[to_keep,]
datExpr = datExpr[to_keep,]

cat(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## 924 SFARI genes remaining
cat(paste0('Removed ', sum(!to_keep), ' genes, ', sum(to_keep), ' remaining'))
## Removed 3028 genes, 16147 remaining
rm(threshold, plot_data, to_keep)


3. Filter outlier samples

\(\qquad\) 3.1 Gandal filters samples belonging to subject AN03345 without giving an explanation. Since it could have some technical problems, I remove them as well

to_keep = (datMeta$Subject_ID != 'AN03345')
datMeta = datMeta[to_keep,]
datExpr = datExpr[,to_keep]

cat(paste0('Removed ', sum(!to_keep), ' samples, ', sum(to_keep), ' remaining'))
## Removed 2 samples, 86 remaining

\(\qquad\) 3.2 Filter out outliers: Using node connectivity as a distance measure, normalising it and filtering out genes farther away than 2 standard deviations from the left (lower connectivity than average, not higher)

absadj = datExpr %>% bicor %>% abs
netsummary = fundamentalNetworkConcepts(absadj)
ku = netsummary$Connectivity
z.ku = (ku-mean(ku))/sqrt(var(ku))

plot_data = data.frame('sample'=1:length(z.ku), 'distance'=z.ku, 'Sample_ID'=datMeta$Sample_ID, 
                       'Subject_ID'=datMeta$Subject_ID, 'Extraction_Batch'=datMeta$RNAExtractionBatch,
                       'Brain_Lobe'=datMeta$Brain_lobe, 'Sex'=datMeta$Sex, 'Age'=datMeta$Age,
                       'Diagnosis'=datMeta$Diagnosis, 'PMI'=datMeta$PMI)
selectable_scatter_plot(plot_data, plot_data[,-c(1,2)])
cat(paste0('Outlier samples: ', paste(as.character(plot_data$Sample_ID[plot_data$distance< -2]), collapse=', ')))
## Outlier samples: AN01971_BA38, AN17254_BA17, AN09714_BA38, AN01093_BA7, AN02987_BA17, AN11796_BA7
to_keep = z.ku > -2
datMeta = datMeta[to_keep,]
datExpr = datExpr[,to_keep]

cat(paste0('Removed ', sum(!to_keep), ' samples, ', sum(to_keep), ' remaining'))
## Removed 6 samples, 80 remaining
rm(absadj, netsummary, ku, z.ku, plot_data, to_keep)




Imputing values

When trying different preprocessing pipelines we noticed that removing the gene ENSG00000187951 (ARHGAP11B) created an important change in the WGCNA modules, and when analysing what could have caused this, we found that this gene had a very big outlier value for a single sample, which we think altered the values of the other genes during the vst transformation enough to alter the modules by the WGNCA algorithm (See 20_02_28_comparison.html for more details about this).

Since the original preprocessing pipeline did not catch this behaviour and it ended up causing problems in downstream analyses, we decided to add a final step to identify problematic genes and fix this errors before they can cause any problems.

The metric be found to work best for detecting this genes is \(metric_i = max_j \frac{|x_{i,j} - mean(x_i)|}{sd(x_i)}\), where \(i \in \{1, ..., n\_genes}\) and \(j \in {1, ... , n\_samples}\), (using this metric, the problematic gene had the 4th largest value of all the dataset). More details about this metric can be found in 20_03_11_z_score_outlier_methods.html.

z_scores = datExpr %>% apply(1, function(x) abs(x-mean(x))/sd(x)) %>% t %>% data.frame

max_z_scores = data.frame('ID' = rownames(z_scores), 'max_z_score' = z_scores %>% apply(1, max), 
                          'outlier_sample' = z_scores %>% apply(1, function(x) datMeta$Sample_ID[which.max(x)]))

rm(z_scores)

Outiler sample detection

It was discovered that the maximum |z-score|s of the genes tend to group around a small set of samples, which suggests the sample is faulty instead of just the measurement for the genes that had their maximum |z-score| there.

The horizontal dotted line shows the limit to the number of outliers in a single sample that are inside three standard deviations of the mean, anything above this value is considered an outlier sample and will be eliminated from the dataset.

samples_info = max_z_scores$outlier_sample %>% table %>% data.frame %>% 
               dplyr::rename(Sample_ID = '.', Count = 'Freq') %>%
               left_join(datMeta, by = 'Sample_ID')  %>% arrange(desc(Count)) %>% 
               mutate('StandardisedCount' = round((Count-mean(Count))/sd(Count),2)) %>%
               dplyr::select(Sample_ID, Count, StandardisedCount, Subject_ID, Sex, Age, Brain_lobe, Batch, Diagnosis)

ggplotly(samples_info %>% ggplot(aes(Subject_ID, Count, fill=Diagnosis, color=Brain_lobe)) + 
         geom_bar(stat = 'identity', size = 0, position = position_dodge2(preserve = 'single')) + xlab('Subjects') +
         geom_hline(yintercept = nrow(max_z_scores)/nrow(datMeta), color = 'gray') + 
         geom_hline(yintercept = mean(samples_info$Count)+3*sd(samples_info$Count), color = 'gray', linetype = 'dotted') + 
         ggtitle('Number of genes for which their maximum |z-score| value fell in each Sample') + 
         theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1),
                                 legend.position = 'none'))
outlier_samples = samples_info$Sample_ID[samples_info$StandardisedCount > 3]
if(length(outlier_samples)>0){
  cat(paste0('Outlier sample(s): ', paste(outlier_samples, collapse = ', ')))
} else {
  cat('There are no outlier samples that need to be removed')
}
## Outlier sample(s): AN13872_BA4_6, AN02987_BA38
# Remove faulty samples
datExpr = datExpr %>% dplyr::select(-paste0('X',datMeta$Dissected_Sample_ID[datMeta$Sample_ID %in% outlier_samples]))
datMeta = datMeta %>% dplyr::filter(!Sample_ID %in% outlier_samples)

cat(paste0('Removed ', length(outlier_samples), ' samples, ', nrow(datMeta), ' remaining'))
## Removed 2 samples, 78 remaining
rm(samples_info, outlier_samples)

Value imputation

Recalculating max|z-score|s without the faulty samples

z_scores = datExpr %>% apply(1, function(x) abs(x-mean(x))/sd(x)) %>% t %>% data.frame

max_z_scores = data.frame('ID' = rownames(z_scores), 'max_z_score' = z_scores %>% apply(1, max), 
                          'outlier_sample' = z_scores %>% apply(1, function(x) datMeta$Sample_ID[which.max(x)]),
                          'outlier_sample_dissected_id' = z_scores %>% apply(1, function(x) datMeta$Dissected_Sample_ID[which.max(x)])) %>%
               mutate(norm_max_z_score = (max_z_score-mean(max_z_score))/sd(max_z_score),
                      has_outlier = max_z_score > mean(max_z_score) + 3*sd(max_z_score))

above_3_sd = mean(max_z_scores$max_z_score) + 3*sd(max_z_scores$max_z_score)
p = max_z_scores %>% ggplot(aes(ID, max_z_score, color=has_outlier)) + geom_point(alpha=0.2) + 
                           geom_hline(yintercept = above_3_sd, color='gray', linetype = 'dotted') + 
                           xlab('Genes') + ylab('Max |z-score|') + theme_minimal() +
                           ggtitle('Max|z-score| value for each gene') +
                           theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(),
                                 panel.grid.major = element_blank())

ggExtra::ggMarginal(p, type = 'density', margins='y', fill='gray', color = 'gray', size=10)

cat(paste0(sum(max_z_scores$has_outlier), ' genes have outlier entries'))
## 289 genes have outlier entries
rm(z_scores, above_3_sd, p)

Genes with the highest outlier scores

Note: Plots are in logarithmic scale and the horizontal line is the median

plot_function = function(i){
  i = 3*i-2
  plot_list = list()
  for(j in 1:3){
    plot_data = data.frame('sample' = colnames(datExpr), 'expr' = t(datExpr[top_genes$ID[i+j-1],]),
                           'Diagnosis' = datMeta$Diagnosis)
    colnames(plot_data)[2] = 'expr'
    plot_list[[j]] = ggplotly(plot_data %>% ggplot(aes(sample, expr+1, color=Diagnosis)) + 
                              geom_hline(yintercept = median(plot_data$expr)+1, color='gray') +
                              geom_point() + theme_minimal() + scale_y_log10() +
                              theme(legend.position='none', axis.text.x=element_blank(), axis.ticks.x=element_blank()))
  }
  p = subplot(plot_list, nrows=1) %>% layout(annotations = list(
    list(x = 0.1 , y = 1.05, text =  top_genes$hgnc_symbol[i],  showarrow = F, xref='paper', yref='paper'),
    list(x = 0.5 , y = 1.05, text = top_genes$hgnc_symbol[i+1], showarrow = F, xref='paper', yref='paper'),
    list(x = 0.9 , y = 1.05, text = top_genes$hgnc_symbol[i+2], showarrow = F, xref='paper', yref='paper')))
  
  return(p)
}

top_genes = max_z_scores %>% dplyr::top_n(n=21, w=max_z_score) %>% arrange(desc(max_z_score)) %>% 
            left_join(datGenes, by=c('ID' = 'ensembl_gene_id')) %>%
            dplyr::select(ID, hgnc_symbol,  max_z_score, norm_max_z_score)

plot_function(1)
plot_function(2)
plot_function(3)
rm(plot_function, top_genes)

Imputed value: The average expression level in that specific gene from the sample’s Diagnosis group (in case the gene has different expression levels by Diagnosis)

  • Since the maximum value we just imputed could be hiding other smaller outlier values underneath it, after its value is imputed, the max|z-score| of the gene is recalculated, and if this value remains above the outlier threshold, the outlier entry is also imputed. This is repeated until there are no more outlier entries.
outliers = which(max_z_scores$has_outlier)

max_z_scores$imputed_values = 0

for(i in outliers){
  max_score = abs(max(abs(datExpr[i,]-mean(datExpr[i,]))/sd(datExpr[i,]))-mean(max_z_scores$max_z_score))/sd(max_z_scores$max_z_score)
  
  while(max_score > 3){
    sample = gsub('X','',names(which.max(abs(datExpr[i,]-mean(datExpr[i,]))/sd(datExpr[i,]))))
    diagnosis = datMeta$Diagnosis[datMeta$Dissected_Sample_ID == sample] %>% as.character
    columns = datMeta$Dissected_Sample_ID[datMeta$Diagnosis == diagnosis & datMeta$Dissected_Sample_ID != sample]
    datExpr[i,paste0('X',sample)] = datExpr[i, paste0('X',columns)] %>% mean %>% round
    
    max_z_scores$imputed_values[i] = max_z_scores$imputed_values[i] + 1
    
    # Prepare for next round
    max_score = abs(max(abs(datExpr[i,]-mean(datExpr[i,]))/sd(datExpr[i,]))-mean(max_z_scores$max_z_score))/sd(max_z_scores$max_z_score)
  }
}

cat('Number of imputed values in each gene')
## Number of imputed values in each gene
table(max_z_scores$imputed_values)
## 
##     0     1     2     3     4 
## 15858   267    12     7     3
rm(i, outliers, max_score, sample, diagnosis, columns)

Samples that originally had the largest outlier values, after correction

plot_function = function(i){
  i = 3*i-2
  plot_list = list()
  for(j in 1:3){
    plot_data = data.frame('sample' = colnames(datExpr), 'expr' = t(datExpr[top_genes$ID[i+j-1],]),
                           'Diagnosis' = datMeta$Diagnosis)
    colnames(plot_data)[2] = 'expr'
    plot_list[[j]] = ggplotly(plot_data %>% ggplot(aes(sample, expr+1, color=Diagnosis)) + 
                              geom_hline(yintercept = median(plot_data$expr)+1, color='gray') +
                              geom_point() + theme_minimal() + scale_y_log10() +
                              theme(legend.position='none', axis.text.x=element_blank(), axis.ticks.x=element_blank()))
  }
  p = subplot(plot_list, nrows=1) %>% layout(annotations = list(
    list(x = 0.1 , y = 1.05, text =  top_genes$hgnc_symbol[i],  showarrow = F, xref='paper', yref='paper'),
    list(x = 0.5 , y = 1.05, text = top_genes$hgnc_symbol[i+1], showarrow = F, xref='paper', yref='paper'),
    list(x = 0.9 , y = 1.05, text = top_genes$hgnc_symbol[i+2], showarrow = F, xref='paper', yref='paper')))
  
  return(p)
}

top_genes = max_z_scores %>% dplyr::top_n(n=21, w=max_z_score) %>% arrange(desc(max_z_score)) %>% 
            left_join(datGenes, by=c('ID' = 'ensembl_gene_id')) %>%
            dplyr::select(ID, hgnc_symbol,  max_z_score, norm_max_z_score)

plot_function(1)
plot_function(2)
plot_function(3)
rm(plot_function, top_genes)
cat(paste0('After filtering, the dataset consists of ', nrow(datExpr), ' genes and ', ncol(datExpr), ' samples'))
## After filtering, the dataset consists of 16147 genes and 78 samples




Batch Effects

According to Tackling the widespread and critical impact of batch effects in high-throughput data, technical artifacts can be an important source of variability in the data, so batch correction should be part of the standard preprocessing pipeline of gene expression data.

They say Processing group and Date of the experiment are good batch surrogates, so I’m going to see if they affect the data in any clear way to use them as surrogates.

Processing group

All the information we have is the Brain Bank, and although all the samples were obtained from the Autism Tissue Project, we don’t have any more specific information about who preprocessed each sample

table(datMeta$Brain_Bank)
## 
## ATP 
##  78


Date of processing

There are two different dates when the data was procesed

table(datMeta$RNAExtractionBatch)
## 
## 10/10/2014  6/20/2014 
##         52         26

Luckily, there doesn’t seem to be a correlation between the batch surrogate and the objective variable, so the batch effect will not get confused with the Diagnosis effect

table(datMeta$RNAExtractionBatch, datMeta$Diagnosis)
##             
##              CTL ASD
##   10/10/2014  24  28
##   6/20/2014   11  15

*All the samples from each subject were processed on the same day

Samples don’t seem to cluster together that strongly for each batch, although there does seem to be some kind of relation

h_clusts = datExpr %>% t %>% dist %>% hclust %>% as.dendrogram

create_viridis_dict = function(){
  min_age = datMeta$Age %>% min
  max_age = datMeta$Age %>% max
  viridis_age_cols = viridis(max_age - min_age + 1)
  names(viridis_age_cols) = seq(min_age, max_age)
  
  return(viridis_age_cols)
}
viridis_age_cols = create_viridis_dict()

dend_meta = datMeta[match(substring(labels(h_clusts),2), datMeta$Dissected_Sample_ID),] %>% 
            mutate('Batch' = ifelse(RNAExtractionBatch=='10/10/2014', '#F8766D', '#00BFC4'),
                   'Diagnosis' = ifelse(Diagnosis=='CTL','#008080','#86b300'), # Blue control, Green ASD
                   'Sex' = ifelse(Sex=='F','#ff6666','#008ae6'),                # Pink Female, Blue Male
                   'Region' = case_when(Brain_lobe=='Frontal'~'#F8766D',        # ggplot defaults for 4 colours
                                        Brain_lobe=='Temporal'~'#7CAE00',
                                        Brain_lobe=='Parietal'~'#00BFC4',
                                        Brain_lobe=='Occipital'~'#C77CFF'),
                   'Age' = viridis_age_cols[as.character(Age)]) %>%             # Purple: young, Yellow: old
            dplyr::select(Age, Region, Sex, Diagnosis, Batch)
h_clusts %>% as.dendrogram %>% dendextend::set('labels', rep('', nrow(datMeta))) %>% 
             dendextend::set('branches_k_color', k=9) %>% plot
colored_bars(colors=dend_meta)

rm(h_clusts, dend_meta, create_viridis_dict, viridis_age_cols)

Comparing the mean expression of each sample by batch we can see there is some batch effect differentiating them

plot_data_b1 = data.frame('Mean'=colMeans(datExpr[,datMeta$RNAExtractionBatch=='10/10/2014']), 'Batch'='10/10/2014')
plot_data_b2 = data.frame('Mean'=colMeans(datExpr[,datMeta$RNAExtractionBatch=='6/20/2014']), 'Batch'='6/20/2014')

plot_data = rbind(plot_data_b1, plot_data_b2)
mu = plot_data %>% group_by(Batch) %>% dplyr::summarise(BatchMean=mean(Mean))

ggplotly(plot_data %>% ggplot(aes(x=Mean, color=Batch, fill=Batch)) + geom_density(alpha=0.3) + 
         geom_vline(data=mu, aes(xintercept=BatchMean, color=Batch), linetype='dashed') +
         ggtitle('Mean expression by sample grouped by Batch') + scale_x_log10() + theme_minimal())
rm(plot_data_b1, plot_data_b2, plot_data, mu)


Looking for unknown sources of batch effects

Following the pipeline from Surrogate variable analysis: hidden batch effects where sva is used with DESeq2.

Create a DeseqDataSet object, estimate the library size correction and save the normalized counts matrix

counts = datExpr %>% as.matrix
rowRanges = GRanges(datGenes$chromosome_name,
                  IRanges(datGenes$start_position, width=datGenes$length),
                  strand=datGenes$strand,
                  feature_id=datGenes$ensembl_gene_id)
se = SummarizedExperiment(assays=SimpleList(counts=counts), rowRanges=rowRanges, colData=datMeta)
dds = DESeqDataSet(se, design = ~ Diagnosis)
## converting counts to integer mode
dds = estimateSizeFactors(dds)
norm.cts = counts(dds, normalized=TRUE)

Provide the normalized counts and two model matrices to SVA. The first matrix uses the biological condition, and the second model matrix is the null model.

mod = model.matrix(~ Diagnosis, colData(dds))
mod0 = model.matrix(~ 1, colData(dds))
sva_fit = svaseq(norm.cts, mod=mod, mod0=mod0)
## Number of significant surrogate variables is:  13 
## Iteration (out of 5 ):1  2  3  4  5
rm(mod, mod0, norm.cts)

Found 13 surrogate variables, since there is no direct way to select which ones to pick Bioconductor answer, kept all of them.

Include SV estimations to datMeta information

sv_data = sva_fit$sv %>% data.frame
colnames(sv_data) = paste0('SV',1:ncol(sv_data))

datMeta_sva = cbind(datMeta, sv_data)

rm(sv_data, sva_fit)

In conclusion: Date of extraction works as a surrogate for batch effect and the sva package found other 13 variables that could work as surrogates which are now included in datMeta and should be included in the DEA.




Normalisation and Differential Expression Analysis

Using DESeq2 package to perform normalisation. I chose this package over limma because limma uses the log transformed data as input instead of the raw counts and I have discovered that in this dataset, this transformation affects genes differently depending on their mean expression level, and genes with a high SFARI score are specially affected by this.

plot_data = data.frame('ID'=rownames(datExpr), 'Mean'=rowMeans(datExpr), 'SD'=apply(datExpr,1,sd))

plot_data %>% ggplot(aes(Mean, SD)) + geom_point(color='#0099cc', alpha=0.1) + geom_abline(color='gray') +
              scale_x_log10() + scale_y_log10() + theme_minimal()

rm(plot_data)
counts = datExpr %>% as.matrix
rowRanges = GRanges(datGenes$chromosome_name,
                  IRanges(datGenes$start_position, width=datGenes$length),
                  strand=datGenes$strand,
                  feature_id=datGenes$ensembl_gene_id)
se = SummarizedExperiment(assays=SimpleList(counts=counts), rowRanges=rowRanges, colData=datMeta_sva)
dds = DESeqDataSet(se, design = ~ Batch + SV1 + SV2 + SV3 + SV4 + SV5 + SV6 + SV7 + SV8 + SV9 + 
                                  SV10 + SV11 + SV12 + SV13 + Diagnosis)
## converting counts to integer mode
# Perform DEA
dds = DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
DE_info = results(dds, lfcThreshold=0, altHypothesis='greaterAbs')

# Perform vst
vsd = vst(dds)

datExpr_vst = assay(vsd)
datMeta_vst = colData(vsd)
datGenes_vst = rowRanges(vsd)

rm(counts, rowRanges, se, vsd)

Using the plotting function DESEq2’s manual proposes to study vst’s output it looks like the data could be homoscedastic

meanSdPlot(datExpr_vst, plot=FALSE)$gg + theme_minimal()

When plotting point by point it seems like the group of genes with the lowest values behave differently to the rest

plot_data = data.frame('ID'=rownames(datExpr_vst), 'Mean'=rowMeans(datExpr_vst), 'SD'=apply(datExpr_vst,1,sd))

plot_data %>% ggplot(aes(Mean, SD)) + geom_point(color='#0099cc', alpha=0.2) + 
              scale_x_log10() + scale_y_log10() + theme_minimal()

rm(plot_data)




Save filtered and annotated dataset

*Could have done this since before

save(datExpr, datMeta, datGenes, file='./../Data/filtered_raw_data_imputed.RData')
#load('./../Data/filtered_raw_data.RData')

Rename normalised datasets to continue working with these

datExpr = datExpr_vst
datMeta = datMeta_vst %>% data.frame
datGenes = datGenes_vst

cat(paste0(length(unique(SFARI_genes$`gene-symbol`[SFARI_genes$ID %in% rownames(datExpr)])), ' SFARI genes remaining'))
## 924 SFARI genes remaining
cat(paste0('After filtering, the dataset consists of ', nrow(datExpr), ' genes and ', ncol(datExpr), ' samples'))
## After filtering, the dataset consists of 16147 genes and 78 samples
rm(datExpr_vst, datMeta_vst, datGenes_vst, datMeta_sva)




Batch Effect Correction

By including the surrogate variables in the DESeq formula we only modelled the batch effects into the DEA, but we didn’t actually correct them from the data, for that we need to use ComBat (or other equivalent package) in the already normalised data

SVA surrogate variables

In some places they say you shouldn’t correct these effects on the data because you risk losing biological variation, in others they say you should because they introduce noise to the data. The only thing everyone agrees on is that you shouldn’t remove them before performing DEA but instead include them in the model.

Based on the conclusions from Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis it seems like it may be a good idea to remove the batch effects from the data and not only from the DE analysis:

  • Using SVA, ComBat or related tools can increase the power to identify specific signals in complex genomic datasets (they found “greatly sharpened global and gene-specific differential expression across treatment groups”)

  • But caution should be exercised to avoid removing biological signal of interest

  • We must be precise and deliberate in the design and analysis of experiments and the resulting data, and also mindful of the limitations we impose with our own perspective

  • Open data exploration is not possible after such supervised “cleaning”, because effects beyond those stipulated by the researcher may have been removed

Comparing data with and without surrogate variable correction

# Taken from https://www.biostars.org/p/121489/#121500
correctDatExpr = function(datExpr, mod, svs) {
  X = cbind(mod, svs)
  Hat = solve(t(X) %*% X) %*% t(X)
  beta = (Hat %*% t(datExpr))
  rm(Hat)
  gc()
  P = ncol(mod)
  return(datExpr - t(as.matrix(X[,-c(1:P)]) %*% beta[-c(1:P),]))
}

pca_samples_before = datExpr %>% t %>% prcomp
pca_genes_before = datExpr %>% prcomp

# Correct
mod = model.matrix(~ Diagnosis, colData(dds))
svs = datMeta %>% dplyr::select(SV1:SV13) %>% as.matrix
datExpr_corrected = correctDatExpr(as.matrix(datExpr), mod, svs)

pca_samples_after = datExpr_corrected %>% t %>% prcomp
pca_genes_after = datExpr_corrected %>% prcomp

rm(correctDatExpr)

Samples

Removing batch effects has a big impact in the distribution of the samples, separating them by diagnosis almost perfectly just using the first principal component

pca_samples_df = rbind(data.frame('ID'=colnames(datExpr), 'PC1'=pca_samples_before$x[,1],
                                  'PC2'=pca_samples_before$x[,2], 'corrected'=0),
                       data.frame('ID'=colnames(datExpr), 'PC1'=pca_samples_after$x[,1],
                                  'PC2'=pca_samples_after$x[,2], 'corrected'=1)) %>%
                 left_join(datMeta %>% mutate('ID'=rownames(datMeta)), by='ID')

ggplotly(pca_samples_df %>% ggplot(aes(PC1, PC2, color=Diagnosis)) + geom_point(aes(frame=corrected, id=ID), alpha=0.75) + 
         xlab(paste0('PC1 (corr=', round(cor(pca_samples_before$x[,1],pca_samples_after$x[,1]),2),
                     '). % Var explained: ', round(100*summary(pca_samples_before)$importance[2,1],1),' to ',
                     round(100*summary(pca_samples_after)$importance[2,1],1))) +
         ylab(paste0('PC2 (corr=', round(cor(pca_samples_before$x[,2],pca_samples_after$x[,2]),2),
                     '). % Var explained: ', round(100*summary(pca_samples_before)$importance[2,2],1),' to ',
                     round(100*summary(pca_samples_after)$importance[2,2],1))) +
         ggtitle('Samples') + theme_minimal())
rm(pca_samples_df)


Genes

It seems like the sva correction preserves the mean expression of the genes and erases almost everything else (although what little else remains is enough to characterise the two Diagnosis groups pretty well using only the first PC)

*Plot is done with only 10% of the genes so it’s not that heavy

pca_genes_df = rbind(data.frame('ID'=rownames(datExpr), 'PC1'=pca_genes_before$x[,1],
                                'PC2'=pca_genes_before$x[,2], 'corrected'=0, 'MeanExpr'=rowMeans(datExpr)),
                     data.frame('ID'=rownames(datExpr), 'PC1'=pca_genes_after$x[,1],
                                'PC2'=pca_genes_after$x[,2], 'corrected'=1, 'MeanExpr'=rowMeans(datExpr)))

keep_genes = rownames(datExpr) %>% sample(0.1*nrow(datExpr))

pca_genes_df = pca_genes_df %>% filter(ID %in% keep_genes)

ggplotly(pca_genes_df %>% ggplot(aes(PC1, PC2,color=MeanExpr)) + geom_point(alpha=0.3, aes(frame=corrected, id=ID)) +
         xlab(paste0('PC1 (corr=', round(cor(pca_genes_before$x[,1],pca_genes_after$x[,1]),2),
                     '). % Var explained: ', round(100*summary(pca_genes_before)$importance[2,1],1),' to ',
                     round(100*summary(pca_genes_after)$importance[2,1],1))) +
         ylab(paste0('PC2 (corr=', round(cor(pca_genes_before$x[,2],pca_genes_after$x[,2]),2),
                     '). % Var explained: ', round(100*summary(pca_genes_before)$importance[2,2],1),' to ',
                     round(100*summary(pca_genes_after)$importance[2,2],1))) +
         scale_color_viridis() + ggtitle('Genes') + theme_minimal())
rm(pca_samples_before, pca_genes_before, mod, svs, pca_samples_after, pca_genes_after, pca_genes_df, keep_genes)

Everything looks good, so we’re keeping the corrected expression dataset

datExpr = datExpr_corrected

rm(datExpr_corrected)


Processing date

Even after correcting the dataset for the surrogate variables found with sva, there is still a difference in mean expression by processing date (although this difference is relatively small)

plot_data_b1 = data.frame('Mean'=colMeans(datExpr[,datMeta$RNAExtractionBatch=='10/10/2014']), 'Batch'='10/10/2014')
plot_data_b2 = data.frame('Mean'=colMeans(datExpr[,datMeta$RNAExtractionBatch=='6/20/2014']), 'Batch'='6/20/2014')

plot_data = rbind(plot_data_b1, plot_data_b2)
mu = plot_data %>% group_by(Batch) %>% dplyr::summarise(BatchMean=mean(Mean))

ggplotly(plot_data %>% ggplot(aes(x=Mean, color=Batch, fill=Batch)) + geom_density(alpha=0.3) + 
         geom_vline(data=mu, aes(xintercept=BatchMean, color=Batch), linetype='dashed') +
         ggtitle('Mean expression by sample grouped by processing date') + scale_x_log10() + theme_minimal())
rm(plot_data_b1, plot_data_b2, plot_data, mu)


Performing Batch Correction for processing date

https://support.bioconductor.org/p/50983/

datExpr = datExpr %>% as.matrix %>% ComBat(batch=datMeta$Batch)
## Found2batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

Now both batches have the same mean expression

plot_data_b1 = data.frame('Mean'=colMeans(datExpr[,datMeta$RNAExtractionBatch=='10/10/2014']), 'Batch'='10/10/2014')
plot_data_b2 = data.frame('Mean'=colMeans(datExpr[,datMeta$RNAExtractionBatch=='6/20/2014']), 'Batch'='6/20/2014')

plot_data = rbind(plot_data_b1, plot_data_b2)
mu = plot_data %>% group_by(Batch) %>% dplyr::summarise(BatchMean=mean(Mean))

ggplotly(plot_data %>% ggplot(aes(x=Mean, color=Batch, fill=Batch)) + geom_density(alpha=0.3) + 
         geom_vline(data=mu, aes(xintercept=BatchMean, color=Batch), linetype='dashed') +
         ggtitle('Mean expression by sample grouped by processing date') + scale_x_log10() + theme_minimal())
rm(plot_data_b1, plot_data_b2, plot_data, mu)



Save preprocessed dataset

save(datExpr, datMeta, datGenes, DE_info, dds, file='./../Data/preprocessed_data_imputed.RData')
#load('./../Data/preprocessed_data.RData')




Session info

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] knitr_1.28                  expss_0.10.2               
##  [3] dendextend_1.13.4           vsn_3.52.0                 
##  [5] WGCNA_1.69                  fastcluster_1.1.25         
##  [7] dynamicTreeCut_1.63-1       sva_3.32.1                 
##  [9] genefilter_1.66.0           mgcv_1.8-31                
## [11] nlme_3.1-144                DESeq2_1.24.0              
## [13] SummarizedExperiment_1.14.1 DelayedArray_0.10.0        
## [15] BiocParallel_1.18.1         matrixStats_0.56.0         
## [17] Biobase_2.44.0              GenomicRanges_1.36.1       
## [19] GenomeInfoDb_1.20.0         IRanges_2.18.3             
## [21] S4Vectors_0.22.1            BiocGenerics_0.30.0        
## [23] biomaRt_2.40.5              ggpubr_0.2.5               
## [25] magrittr_1.5                ggExtra_0.9                
## [27] GGally_1.5.0                gridExtra_2.3              
## [29] viridis_0.5.1               viridisLite_0.3.0          
## [31] RColorBrewer_1.1-2          plotlyutils_0.0.0.9000     
## [33] plotly_4.9.2                glue_1.3.2                 
## [35] reshape2_1.4.3              forcats_0.5.0              
## [37] stringr_1.4.0               dplyr_0.8.5                
## [39] purrr_0.3.3                 readr_1.3.1                
## [41] tidyr_1.0.2                 tibble_3.0.0               
## [43] ggplot2_3.3.0               tidyverse_1.3.0            
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.1.5        Hmisc_4.4-0           
##   [4] plyr_1.8.6             lazyeval_0.2.2         splines_3.6.3         
##   [7] crosstalk_1.1.0.1      digest_0.6.25          foreach_1.5.0         
##  [10] htmltools_0.4.0        GO.db_3.8.2            fansi_0.4.1           
##  [13] checkmate_2.0.0        memoise_1.1.0          doParallel_1.0.15     
##  [16] cluster_2.1.0          limma_3.40.6           annotate_1.62.0       
##  [19] modelr_0.1.6           prettyunits_1.1.1      jpeg_0.1-8.1          
##  [22] colorspace_1.4-1       blob_1.2.1             rvest_0.3.5           
##  [25] haven_2.2.0            xfun_0.12              hexbin_1.28.1         
##  [28] crayon_1.3.4           RCurl_1.98-1.1         jsonlite_1.6.1        
##  [31] impute_1.58.0          iterators_1.0.12       survival_3.1-11       
##  [34] gtable_0.3.0           zlibbioc_1.30.0        XVector_0.24.0        
##  [37] scales_1.1.0           DBI_1.1.0              miniUI_0.1.1.1        
##  [40] Rcpp_1.0.4             xtable_1.8-4           progress_1.2.2        
##  [43] htmlTable_1.13.3       foreign_0.8-75         bit_1.1-15.2          
##  [46] preprocessCore_1.46.0  Formula_1.2-3          htmlwidgets_1.5.1     
##  [49] httr_1.4.1             acepack_1.4.1          ellipsis_0.3.0        
##  [52] farver_2.0.3           pkgconfig_2.0.3        reshape_0.8.8         
##  [55] XML_3.99-0.3           nnet_7.3-13            dbplyr_1.4.2          
##  [58] locfit_1.5-9.4         labeling_0.3           tidyselect_1.0.0      
##  [61] rlang_0.4.5            later_1.0.0            AnnotationDbi_1.46.1  
##  [64] munsell_0.5.0          cellranger_1.1.0       tools_3.6.3           
##  [67] cli_2.0.2              generics_0.0.2         RSQLite_2.2.0         
##  [70] broom_0.5.5            evaluate_0.14          fastmap_1.0.1         
##  [73] yaml_2.2.1             bit64_0.9-7            fs_1.4.0              
##  [76] mime_0.9               xml2_1.2.5             compiler_3.6.3        
##  [79] rstudioapi_0.11        curl_4.3               png_0.1-7             
##  [82] affyio_1.54.0          ggsignif_0.6.0         reprex_0.3.0          
##  [85] geneplotter_1.62.0     stringi_1.4.6          highr_0.8             
##  [88] lattice_0.20-40        Matrix_1.2-18          vctrs_0.2.4           
##  [91] pillar_1.4.3           lifecycle_0.2.0        BiocManager_1.30.10   
##  [94] cowplot_1.0.0          data.table_1.12.8      bitops_1.0-6          
##  [97] httpuv_1.5.2           affy_1.62.0            R6_2.4.1              
## [100] latticeExtra_0.6-29    promises_1.1.0         codetools_0.2-16      
## [103] assertthat_0.2.1       withr_2.1.2            GenomeInfoDbData_1.2.1
## [106] hms_0.5.3              grid_3.6.3             rpart_4.1-15          
## [109] rmarkdown_2.1          shiny_1.4.0.2          lubridate_1.7.4       
## [112] base64enc_0.1-3